-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wren-ai-service): Embed the SQL pairs in MDL #1082
Conversation
WalkthroughThis pull request introduces significant changes to the Wren AI service, focusing on SQL pairs indexing and processing. The modifications span multiple files, introducing a new Changes
Sequence DiagramsequenceDiagram
participant User
participant SqlPairs
participant DocumentConverter
participant Embedder
participant DocumentStore
participant Cleaner
User->>SqlPairs: Run with MDL string
SqlPairs->>DocumentConverter: Convert SQL pairs to documents
DocumentConverter-->>SqlPairs: Return documents
SqlPairs->>Embedder: Embed documents
Embedder-->>SqlPairs: Return embedded documents
SqlPairs->>Cleaner: Clean SQL pairs
Cleaner-->>SqlPairs: Return cleaned pairs
SqlPairs->>DocumentStore: Write documents
Poem
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
7bd856c
to
75fdd82
Compare
75fdd82
to
8d9a9d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (3)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (2)
Line range hint
16-23
: Maintain consistent parameter naming across the codebase.The parameter
id
should be renamed toproject_id
for consistency with the changes in other files and the call tosql_pairs_cleaner.run
.async def delete_sql_pairs( sql_pairs_cleaner: SqlPairsCleaner, sql_pair_ids: List[str], - id: Optional[str] = None, + project_id: Optional[str] = None, ) -> None: - return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=id) + return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=project_id)
Line range hint
44-54
: Maintain consistent parameter naming in the pipeline class.The
run
method still usesid
parameter while the rest of the codebase has moved toproject_id
.async def run( - self, sql_pair_ids: List[str], id: Optional[str] = None + self, sql_pair_ids: List[str], project_id: Optional[str] = None ) -> Dict[str, Any]: logger.info("SQL Pairs Deletion pipeline is running...") return await self._pipe.execute( ["delete_sql_pairs"], inputs={ "sql_pair_ids": sql_pair_ids, - "id": id or "", + "project_id": project_id or "", **self._components, }, )wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)
Line range hint
37-37
: Update test function name for consistency.The test function name still contains "preparation" while others have been updated to "indexing".
-async def test_sql_pairs_preparation_saving_to_document_store_with_multiple_project_ids(): +async def test_sql_pairs_indexing_saving_to_document_store_with_multiple_project_ids():
🧹 Nitpick comments (12)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)
184-188
: Avoid hard-coded test data in the main blockIn the
__main__
block used for a dry run, themdl_str
is hard-coded with sample data. While this is acceptable for testing, consider acceptingmdl_str
as a command-line argument or reading from a file to make the script more flexible.wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)
20-25
: Use consistent project IDs in testsIn the test, you use
project_id="fake-id"
when runningsql_pairs.run
, but later useid="fake-id-2"
when runningsql_pairs_deletion.run
. This might lead to confusion as the IDs do not match, and could affect the accuracy of the test.Consider using the same
project_id
throughout the test to ensure consistency:- await sql_pairs_deletion.run(id="fake-id-2", sql_pair_ids=["1", "2"]) + await sql_pairs_deletion.run(id="fake-id", sql_pair_ids=["1", "2"])wren-ai-service/tools/mdl_to_str.py (1)
24-29
: Simplify string escaping using 'json.dumps'The manual replacement of backslashes and double quotes can be error-prone. Consider using the built-in
json.dumps
function with theensure_ascii=False
andindent=None
parameters for proper string serialization.Replace the
to_str
function implementation with:import json def to_str(mdl: dict) -> str: return json.dumps(mdl, ensure_ascii=False)wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1)
29-29
: Track technical debt with an issue.The TODO comment suggests removing this pipeline. This should be tracked in the issue tracker for proper follow-up.
Would you like me to create a GitHub issue to track this technical debt?
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)
Line range hint
10-34
: Reduce test code duplication.Both test functions share similar setup and assertion logic. Consider extracting common setup into fixtures and shared assertion helpers.
@pytest.fixture async def sql_pairs_store(): pipe_components = generate_components(settings.components) document_store_provider = pipe_components["sql_pairs_indexing"]["document_store_provider"] store = document_store_provider.get_store( dataset_name="sql_pairs", recreate_index=True, ) return store, pipe_components async def assert_documents(store, expected_count, project_id=None): assert await store.count_documents() == expected_count filters = None if project_id: filters = { "operator": "AND", "conditions": [ {"field": "project_id", "operator": "==", "value": project_id}, ], } documents = store.filter_documents(filters=filters) for document in documents: assert document.content, "content should not be empty" assert document.meta, "meta should not be empty" assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta" assert document.meta.get("sql"), "sql should be in meta"Also applies to: 37-71
24-25
: Use test constants for better maintainability.Hard-coded MDL strings and test data paths should be moved to constants or test fixtures.
TEST_DATA = { 'MDL_STR': '{"models": [{"properties": {"boilerplate": "test"}}]}', 'SQL_PAIRS_PATH': "tests/data/pairs.json" }Also applies to: 52-53, 57-58
wren-ai-service/Justfile (2)
73-74
: Add documentation for the new target.The
mdl-to-str
target lacks documentation about its purpose and usage. Consider adding a comment explaining what it does and how to use it.+# Convert MDL JSON file to a properly escaped string format +# Usage: just mdl-to-str path/to/mdl.json mdl-to-str mdl_path="": poetry run python tools/mdl_to_str.py -p {{mdl_path}}
73-74
: Add parameter validation.The target should validate that
mdl_path
is provided and exists before running the Python script.mdl-to-str mdl_path="": + #!/usr/bin/env bash + if [ -z "{{mdl_path}}" ]; then + echo "Error: mdl_path is required" + exit 1 + fi + if [ ! -f "{{mdl_path}}" ]; then + echo "Error: File {{mdl_path}} does not exist" + exit 1 + fi poetry run python tools/mdl_to_str.py -p {{mdl_path}}wren-ai-service/src/config.py (1)
57-57
: Add documentation for the new configuration field.The
sql_pairs_path
field should be documented to explain its purpose, expected format, and usage.+ # Path to the JSON file containing SQL pairs for embedding in MDL + # The file should contain an array of SQL pair objects sql_pairs_path: str = Field(default="pairs.json")wren-ai-service/src/pipelines/generation/sql_generation.py (1)
35-43
: Consider adding sample validation.While the SQL samples integration looks good, consider adding validation for the sample format to ensure each sample contains both
question
andsql
fields.@observe(capture_input=False) def prompt( query: str, documents: List[str], exclude: List[Dict], text_to_sql_rules: str, prompt_builder: PromptBuilder, configuration: Configuration | None = None, sql_samples: List[Dict] | None = None, ) -> dict: + if sql_samples: + for sample in sql_samples: + if not all(key in sample for key in ['question', 'sql']): + raise ValueError("Each SQL sample must contain 'question' and 'sql' fields") return prompt_builder.run( query=query, documents=documents, exclude=exclude, text_to_sql_rules=text_to_sql_rules, instructions=construct_instructions(configuration), sql_samples=sql_samples, current_time=configuration.show_current_time(), )wren-ai-service/src/globals.py (1)
227-228
: Verify pipeline component naming consistency.The pipeline key is "sql_pairs_preparation" but uses components from "sql_pairs_indexing". Consider aligning these names for better maintainability.
- "sql_pairs_preparation": indexing.SqlPairs( - **pipe_components["sql_pairs_indexing"], + "sql_pairs_indexing": indexing.SqlPairs( + **pipe_components["sql_pairs_indexing"],deployment/kustomizations/base/cm.yaml (1)
185-188
: Consider documenting the pairs.json structure.The pairs.json is initialized with an empty sample array. Consider adding:
- Documentation about the expected structure of the array
- Example entries to illustrate the format
pairs.json: | { + # Array of SQL pairs for indexing + # Example format: + # "sample": [ + # { + # "question": "What is the total sales?", + # "sql": "SELECT SUM(sales) FROM transactions" + # } + # ] "sample": [] }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
deployment/kustomizations/base/cm.yaml
(2 hunks)deployment/kustomizations/base/deploy-wren-ai-service.yaml
(2 hunks)docker/config.example.yaml
(1 hunks)docker/docker-compose-dev.yaml
(1 hunks)docker/docker-compose.yaml
(1 hunks)wren-ai-service/Justfile
(1 hunks)wren-ai-service/src/config.py
(1 hunks)wren-ai-service/src/globals.py
(2 hunks)wren-ai-service/src/pipelines/generation/sql_generation.py
(1 hunks)wren-ai-service/src/pipelines/generation/utils/sql.py
(1 hunks)wren-ai-service/src/pipelines/indexing/__init__.py
(2 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
(1 hunks)wren-ai-service/src/web/v1/services/semantics_preparation.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
(2 hunks)wren-ai-service/tests/data/pairs.json
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py
(2 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py
(1 hunks)wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
(2 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)wren-ai-service/tools/mdl_to_str.py
(1 hunks)
💤 Files with no reviewable changes (1)
- wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
✅ Files skipped from review due to trivial changes (3)
- wren-ai-service/tests/data/pairs.json
- wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
- wren-ai-service/src/pipelines/generation/utils/sql.py
🔇 Additional comments (18)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)
73-76
: Ensure all SQL pairs have valid 'id' fieldsWhen creating
SqlPair
instances,pair.get("id")
may returnNone
if the 'id' key is missing in the pair. Sinceid
is a required field of typestr
in theSqlPair
model, this might raise validation errors.Please ensure that all entries in
external_pairs
have an 'id' field. If 'id' might be missing, consider providing a default value or handling missing IDs appropriately.wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)
5-5
: Updated imports reflect changes in the pipeline structureThe import statement correctly replaces
SqlPairsPreparation
withSqlPairs
, aligning with the updated pipeline naming conventions.
29-32
: Verify deletion operation in testAfter running
sql_pairs_deletion.run
, the test asserts that the document count in the store has decreased. Ensure that the test correctly verifies the deletion operation by checking the actual contents of the store.As a follow-up, you can enhance the test by retrieving the remaining documents and asserting that the deleted documents are no longer present.
wren-ai-service/src/pipelines/indexing/__init__.py (3)
96-98
: LGTM! Parameter rename improves clarity.The parameter rename from
id
toproject_id
enhances code readability and maintains consistency with the project's naming conventions.
106-108
: LGTM! Filter condition is correctly updated.The filter condition correctly uses the renamed
project_id
parameter, maintaining the logical integrity of the document filtering.
117-117
: LGTM! Module exports are properly updated.The import and
__all__
list changes correctly reflect the refactoring fromSqlPairsPreparation
toSqlPairs
.Also applies to: 126-126
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1)
28-28
: LGTM! Key rename improves semantic clarity.The change from
"summary"
to"question"
better represents the actual content being formatted and aligns with the SQL pairs domain terminology.wren-ai-service/src/web/v1/services/semantics_preparation.py (1)
82-87
: LGTM! SQL pairs pipeline task properly integrated.The addition of
"sql_pairs"
to the pipeline tasks ensures proper integration with the semantics preparation process while maintaining concurrent execution throughasyncio.gather()
.wren-ai-service/src/pipelines/generation/sql_generation.py (1)
35-43
: LGTM! SQL samples section enhances LLM guidance.The addition of the SQL samples section in the prompt template provides valuable examples to guide the LLM in generating more accurate SQL queries.
Consider the impact on token limits when many samples are provided. Run this script to analyze the template size:
✅ Verification successful
Token limits are properly handled by the system architecture
The system is well-designed with built-in safeguards:
- Uses GPT-4's standard 4096 token limit
- Retrieves relevant SQL samples through a dedicated pipeline
- Conditionally includes samples in the template
- Has default sample limits in place
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Calculate approximate token count for the template echo 'Analyzing template size...' rg -A 10 'sql_generation_user_prompt_template = """' | wc -wLength of output: 128
Script:
#!/bin/bash # Get the complete template content echo "=== Complete Template ===" rg -U "sql_generation_user_prompt_template = \"\"\".*?\"\"\"" -A 50 echo -e "\n=== Token Limit Configurations ===" rg -i "token.*limit|max.*token" echo -e "\n=== Sample Handling Logic ===" rg -l "sql_samples" | xargs ast-grep --pattern 'def $FUNC($$$) { $$$ sql_samples $$$ }'Length of output: 1685
Script:
#!/bin/bash # Look for sample selection/limiting logic echo "=== Sample Selection Logic ===" rg -A 10 "def.*sql_samples" echo -e "\n=== Sample List Creation ===" rg -B 5 -A 5 "sql_samples.*=.*"Length of output: 9163
Script:
#!/bin/bash # Look for sample limit configurations echo "=== Sample Limit Configurations ===" rg -i "sample.*limit|limit.*sample|top.*sample|max.*sample" echo -e "\n=== SQL Pairs Retrieval Logic ===" rg -A 10 "sql_pairs_retrieval.*run"Length of output: 2337
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2)
94-94
: Address TODO comment before merging.The comment indicates that the SQL pairs preparation functionality needs proper implementation. This should be addressed before merging to ensure complete functionality.
Would you like me to help implement the proper SQL pairs preparation functionality or create a GitHub issue to track this task?
96-97
: Verify parameter renaming impact.The parameter rename from
id
toproject_id
improves naming consistency, but we should verify all callers have been updated.✅ Verification successful
Parameter rename verification complete - all callers updated
The parameter rename from 'id' to 'project_id' has been properly implemented across the codebase. All relevant callers have been updated to use the new parameter name. The remaining uses of 'id' in the codebase are for different purposes (like sql_pair.id or sql_pairs_preparation_id) and not related to this parameter rename.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any remaining uses of the old parameter name rg "id=" -A 5 -B 5Length of output: 117072
Script:
#!/bin/bash # Search for function calls with id parameter rg "id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3 # Search for function calls with project_id parameter rg "project_id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3 # Use ast-grep to find function calls with id parameter ast-grep --pattern 'id=$_' -l py # Use ast-grep to find function calls with project_id parameter ast-grep --pattern 'project_id=$_' -l pyLength of output: 1416
Script:
#!/bin/bash # Search for function calls with id parameter in sql_pairs related files rg ".*\(.*id=.*\)" --type py -g "*sql_pairs*.py" -A 3 -B 3 # Search for function calls with project_id parameter in sql_pairs related files rg ".*\(.*project_id=.*\)" --type py -g "*sql_pairs*.py" -A 3 -B 3 # Search for any remaining uses of id parameter in the codebase rg "\bid=(?![\w-])" --type py -g "!*test*" -A 3 -B 3 # Search for any remaining uses of project_id parameter in the codebase rg "\bproject_id=(?![\w-])" --type py -g "!*test*" -A 3 -B 3Length of output: 10090
wren-ai-service/src/globals.py (1)
79-82
: LGTM! Well-structured pipeline integration.The SQL pairs pipeline component is properly integrated with the settings and follows the established pattern for pipeline configuration.
docker/docker-compose-dev.yaml (1)
48-48
: LGTM! Proper configuration for SQL pairs.The environment variable and volume mapping for SQL pairs are correctly configured and follow the established pattern.
Also applies to: 53-53
deployment/kustomizations/base/deploy-wren-ai-service.yaml (1)
66-67
: LGTM! Consistent deployment configuration.The environment variable and config-volume configuration for SQL pairs are properly set up and maintain consistency with the docker-compose configuration.
Also applies to: 77-78
docker/docker-compose.yaml (1)
59-59
: LGTM! Environment variable and volume mapping for SQL pairs configuration.The configuration follows Docker best practices by properly mapping the configuration file and setting up the corresponding environment variable.
Also applies to: 64-64
docker/config.example.yaml (1)
Line range hint
107-115
: Verify the pipeline dependencies and configuration.The SQL pairs pipeline components are well-structured with appropriate dependencies. However, let's verify the pipeline configuration usage across the codebase.
✅ Verification successful
Old references to
sql_pairs_preparation
exist but don't impact functionalityThe pipeline components are correctly configured and used throughout the codebase. The references to
sql_pairs_preparation
only exist in test files and service names, not in the actual pipeline configuration or usage. No changes are required.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the pipeline configuration usage # Check for pipeline component references rg -A 3 "sql_pairs_(indexing|deletion|retrieval)" # Check for any potential old references to sql_pairs_preparation rg "sql_pairs_preparation"Length of output: 20474
🧰 Tools
🪛 yamllint (1.35.1)
[error] 112-112: trailing spaces
(trailing-spaces)
wren-ai-service/tools/config/config.example.yaml (1)
Line range hint
125-133
: Configuration is consistent with docker/config.example.yaml🧰 Tools
🪛 yamllint (1.35.1)
[error] 130-130: trailing spaces
(trailing-spaces)
wren-ai-service/tools/config/config.full.yaml (1)
Line range hint
144-152
: Verify the removal of LLM dependency from sql_pairs_indexing.The LLM dependency has been removed from the sql_pairs_indexing pipeline while being retained in sql_pairs_retrieval. Please confirm if this architectural change is intentional and doesn't impact the indexing functionality.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 149-149: trailing spaces
(trailing-spaces)
8d9a9d4
to
b165de2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)
20-25
: Consider using a test fixture for the pairs.json path.Instead of hardcoding the test data path, consider:
- Moving it to a test fixture or constant
- Using a temporary file for isolation
This would make the tests more maintainable and prevent potential issues with file locations in different environments.+# At the top of the file +TEST_SQL_PAIRS_PATH = "tests/data/pairs.json" - **pipe_components["sql_pairs_indexing"], sql_pairs_path="tests/data/pairs.json" + **pipe_components["sql_pairs_indexing"], sql_pairs_path=TEST_SQL_PAIRS_PATHwren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)
Line range hint
10-34
: Enhance test coverage for SQL pairs indexing.While the test verifies basic document storage, consider adding assertions for:
- MDL string integration
- Project ID in metadata
- Specific content from pairs.json
assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta" assert document.meta.get("sql"), "sql should be in meta" + assert document.meta.get("project_id") == "fake-id", "project_id should match" + # Verify MDL integration + assert document.meta.get("mdl") is not None, "mdl should be in meta"
Line range hint
51-71
: Enhance multi-project ID test coverage.The test could be more thorough in verifying the indexed documents:
- Add assertions for both project IDs
- Verify that documents are correctly associated with their respective projects
- Check that MDL strings are properly stored
assert len(documents) == 2 + # Verify documents for second project + documents_2 = store.filter_documents( + filters={ + "operator": "AND", + "conditions": [ + {"field": "project_id", "operator": "==", "value": "fake-id-2"}, + ], + } + ) + assert len(documents_2) == 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
deployment/kustomizations/base/cm.yaml
(2 hunks)deployment/kustomizations/base/deploy-wren-ai-service.yaml
(2 hunks)docker/config.example.yaml
(1 hunks)docker/docker-compose-dev.yaml
(1 hunks)docker/docker-compose.yaml
(1 hunks)wren-ai-service/Justfile
(1 hunks)wren-ai-service/src/config.py
(1 hunks)wren-ai-service/src/globals.py
(2 hunks)wren-ai-service/src/pipelines/generation/sql_generation.py
(1 hunks)wren-ai-service/src/pipelines/generation/utils/sql.py
(1 hunks)wren-ai-service/src/pipelines/indexing/__init__.py
(2 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
(1 hunks)wren-ai-service/src/web/v1/services/semantics_preparation.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
(2 hunks)wren-ai-service/tests/data/pairs.json
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py
(2 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py
(1 hunks)wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
(2 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)wren-ai-service/tools/mdl_to_str.py
(1 hunks)
💤 Files with no reviewable changes (1)
- wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
🚧 Files skipped from review as they are similar to previous changes (20)
- wren-ai-service/src/config.py
- deployment/kustomizations/base/deploy-wren-ai-service.yaml
- wren-ai-service/src/pipelines/generation/utils/sql.py
- wren-ai-service/Justfile
- wren-ai-service/tests/data/pairs.json
- wren-ai-service/tools/config/config.example.yaml
- wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
- docker/config.example.yaml
- deployment/kustomizations/base/cm.yaml
- wren-ai-service/src/web/v1/services/semantics_preparation.py
- docker/docker-compose.yaml
- wren-ai-service/tools/config/config.full.yaml
- docker/docker-compose-dev.yaml
- wren-ai-service/tools/mdl_to_str.py
- wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
- wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
- wren-ai-service/src/pipelines/generation/sql_generation.py
- wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
- wren-ai-service/src/pipelines/indexing/init.py
- wren-ai-service/src/pipelines/indexing/sql_pairs.py
🔇 Additional comments (4)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)
5-5
: LGTM! Component renaming reflects the functionality better.The change from "preparation" to "indexing" better represents the actual functionality of storing and managing SQL pairs.
Also applies to: 13-13
29-32
: 🛠️ Refactor suggestionEnhance test robustness for SQL pair deletion.
The test uses hardcoded SQL pair IDs without verifying their existence in pairs.json. Consider:
- Reading actual IDs from the test data file
- Adding assertions to verify the specific documents were deleted
Let's verify the content of pairs.json:
wren-ai-service/src/globals.py (2)
79-82
: LGTM! Clean integration of SQL pairs indexing.The addition of the SQL pairs pipeline with configuration from settings is well-structured and consistent with the overall changes.
227-228
: Consider renaming the preparation service for consistency.While the component has been updated to use indexing.SqlPairs, the service class name
SqlPairsPreparationService
still contains "preparation". Consider renaming it to maintain consistency with the new indexing terminology.- sql_pairs_preparation_service=SqlPairsPreparationService( + sql_pairs_indexing_service=SqlPairsIndexingService(Note: This change would require updating the service class name and its usages across the codebase.
Let's check for other occurrences:
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)
Line range hint
38-71
: Consider adding negative test cases.While the test covers multiple project IDs well, consider adding test cases for:
- Invalid MDL string format
- Non-existent project IDs
- Empty or malformed SQL pairs file
Line range hint
1-71
: Consider separating test data setup.To improve test maintainability, consider:
- Moving the test data setup to a fixture
- Creating a dedicated test data factory for generating MDL strings
- Using parametrized tests for different SQL pair scenarios
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
wren-ai-service/src/pipelines/indexing/sql_pairs.py
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- wren-ai-service/src/pipelines/indexing/sql_pairs.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: pytest
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (3)
5-5
: LGTM! Import and component changes align with the pipeline refactoring.The changes from
SqlPairsPreparation
toSqlPairs
and the component renaming to "sql_pairs_indexing" maintain consistency with the broader architectural changes.Also applies to: 13-13
10-10
: Function name updated for consistency.The function name has been updated from "preparation" to "indexing" as previously suggested.
20-25
: Verify the test data file existence and format.The test relies on
tests/data/pairs.json
. Let's verify its existence and format.✅ Verification successful
Test data file exists and is properly formatted
The file exists at "wren-ai-service/tests/data/pairs.json" with valid JSON structure containing appropriate test data for SQL pairs testing.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if the test data file exists and validate its JSON structure if [ -f "tests/data/pairs.json" ]; then echo "File exists. Validating JSON structure..." jq '.' tests/data/pairs.json else echo "Error: tests/data/pairs.json not found" exit 1 fiLength of output: 130
Script:
#!/bin/bash # Search for any pairs.json files in the repository echo "Searching for pairs.json files:" fd pairs.json echo -e "\nSearching for any JSON files in tests directory:" fd -e json . tests/ echo -e "\nSearching for other test files that might use pairs.json:" rg -l "pairs.json" tests/Length of output: 559
Script:
#!/bin/bash # Validate the JSON structure of the found file echo "Validating JSON structure of the found file:" jq '.' wren-ai-service/tests/data/pairs.jsonLength of output: 372
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
13f4ff0
to
81f13bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)
38-58
: Update documentation strings in test functionsThe docstrings or comments within the test functions still refer to the old
sql_pairs_preparation
naming. Consider updating any internal documentation to match the new naming convention for clarity.Apply this diff to update the comments:
# Existing comments or docstrings, if any
Note: Since the provided code does not include comments, ensure that any existing comments are updated accordingly.
deployment/kustomizations/base/cm.yaml (1)
184-184
: Add newline at end of file.Add a newline character at the end of the file to comply with YAML best practices.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 184-184: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
deployment/kustomizations/base/cm.yaml
(2 hunks)docker/config.example.yaml
(1 hunks)wren-ai-service/Justfile
(1 hunks)wren-ai-service/src/config.py
(1 hunks)wren-ai-service/src/globals.py
(2 hunks)wren-ai-service/src/pipelines/generation/sql_generation.py
(1 hunks)wren-ai-service/src/pipelines/generation/utils/sql.py
(1 hunks)wren-ai-service/src/pipelines/indexing/__init__.py
(2 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
(1 hunks)wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
(0 hunks)wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
(1 hunks)wren-ai-service/src/web/v1/services/semantics_preparation.py
(1 hunks)wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
(2 hunks)wren-ai-service/tests/data/pairs.json
(1 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py
(2 hunks)wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py
(1 hunks)wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
(2 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)wren-ai-service/tools/mdl_to_str.py
(1 hunks)
💤 Files with no reviewable changes (1)
- wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
🚧 Files skipped from review as they are similar to previous changes (15)
- wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
- docker/config.example.yaml
- wren-ai-service/tools/config/config.example.yaml
- wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
- wren-ai-service/src/web/v1/services/semantics_preparation.py
- wren-ai-service/Justfile
- wren-ai-service/tests/data/pairs.json
- wren-ai-service/tools/mdl_to_str.py
- wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
- wren-ai-service/src/pipelines/generation/utils/sql.py
- wren-ai-service/tools/config/config.full.yaml
- wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
- wren-ai-service/src/pipelines/generation/sql_generation.py
- wren-ai-service/src/config.py
- wren-ai-service/src/pipelines/indexing/init.py
🧰 Additional context used
🪛 yamllint (1.35.1)
deployment/kustomizations/base/cm.yaml
[error] 184-184: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: pytest
- GitHub Check: Analyze (go)
🔇 Additional comments (7)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)
54-57
:⚠️ Potential issueAdd exception handling for JSON parsing errors in
boilerplates
functionThe function
boilerplates
parsesmdl_str
without handling potential JSON decoding exceptions. Ifmdl_str
contains invalid JSON,orjson.loads
will raise an exception, which could cause the pipeline to fail unexpectedly.Apply this diff to handle JSON parsing errors gracefully:
def boilerplates( mdl_str: str, ) -> Set[str]: + try: mdl = orjson.loads(mdl_str) + except orjson.JSONDecodeError as e: + logger.error(f"Error parsing MDL JSON string: {e}") + return set()Likely invalid or redundant comment.
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)
20-25
: Update test to match newSqlPairs
initialization parametersThe instantiation of
SqlPairs
now includessql_pairs_path
, but theproject_id
parameter is passed during therun
method call. Ensure that the test accurately reflects the new initialization and method signatures.
29-32
: Verify SQL pair IDs correspond to the correctproject_id
In the deletion test, SQL pair IDs
["1", "2"]
are used withid="fake-id-2"
, but the documents were added withproject_id="fake-id"
. Ensure that the deletion call uses the correctproject_id
to accurately test the deletion functionality.Consider updating the deletion call to match the
project_id
used during insertion:await sql_pairs_deletion.run( - id="fake-id-2", + id="fake-id", sql_pair_ids=["1", "2"] )wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)
10-25
: Ensure consistent naming of test functionsThe test function name has been updated to
test_sql_pairs_indexing_saving_to_document_store
, which improves consistency with the newSqlPairs
class. Good job on updating the naming to reflect the changes.deployment/kustomizations/base/cm.yaml (1)
155-157
: LGTM! SQL pairs pipeline configuration is well-structured.The new pipeline components are properly configured with appropriate dependencies, and the removal of the LLM dependency from
sql_pairs_indexing
aligns with the PR objective of simplifying the pipeline.wren-ai-service/src/globals.py (2)
79-82
: LGTM! SQL pairs integration is properly configured.The new SQL pairs pipeline is well-integrated into the semantics preparation service with proper dependency injection and configuration.
227-228
: Verify sql_pairs_path configuration.Ensure that the
sql_pairs_path
setting is properly configured in the environment and the path exists in the deployment.
return [ | ||
SqlPair( | ||
id=pair.get("id"), | ||
question=pair.get("question"), | ||
sql=pair.get("sql"), | ||
) | ||
for boilerplate in boilerplates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure id
field is not missing in SqlPair
instances
In the sql_pairs
function, the id
field is retrieved using pair.get("id")
, which may return None
if the key is missing. Since SqlPair
requires id
to be a string, a missing id
could lead to validation errors or unintended behavior.
Apply this diff to enforce the presence of the id
field:
def sql_pairs(
boilerplates: Set[str],
external_pairs: Dict[str, Any],
) -> List[SqlPair]:
return [
SqlPair(
- id=pair.get("id"),
+ id=pair["id"],
question=pair.get("question"),
sql=pair.get("sql"),
)
for boilerplate in boilerplates
if boilerplate in external_pairs
for pair in external_pairs[boilerplate]
]
Alternatively, provide a default id
or handle cases where id
might be missing to prevent potential errors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return [ | |
SqlPair( | |
id=pair.get("id"), | |
question=pair.get("question"), | |
sql=pair.get("sql"), | |
) | |
for boilerplate in boilerplates | |
return [ | |
SqlPair( | |
id=pair["id"], | |
question=pair.get("question"), | |
sql=pair.get("sql"), | |
) | |
for boilerplate in boilerplates |
Changes
Pipeline Simplification
Configuration Changes
SQL_PAIRS_PATH
environment variablepairs.json
configuration file supportSample Management
Testing
Related Changes
Summary by CodeRabbit
Release Notes
New Features
Changes
sql_pairs_preparation
pipeline tosql_pairs_indexing
Improvements
Testing